EJHG Open 



European Journal of Human Genetics (2012) 20, 469-475 

© 2012 Macmillan Publishers Limited All rights reserved 1018-4813/12 



www.nature.com/ejhg 



ARTICLE 



A statistical method for region-based meta-analysis 
of genome-wide association studies in genetically 
diverse populations 

Xu Wang 1 , Xuanyao Liu 2 , Xueling Sim 3 , Haiyan Xu 3 , Chiea-Chuen Khor 4 , Rick Twee-Hee Ong 2,3 , 

Wan-Ting Tay 5 , Chen Suo 3 , Wan-Ting Poh 1 , Daniel Peng-Keat Ng 1 , Jianjun Liu 4 , Tin Aung 5,6 , Kee-Seng Chia 1,2,3 , 

Tien-Yin Wong 5 ' 6 ' 7 ' 8 , E-Shyong Tai 1 ' 8 and Yik-Ying Teo*' 1 ' 2 ' 3 ' 4 ' 9 

Genome-wide association studies (GWAS) have become the preferred experimental design in exploring the genetic etiology of 
complex human traits and diseases. Standard SNP-based meta-analytic approaches have been utilized to integrate the results 
from multiple experiments. This fundamentally assumes that the patterns of linkage disequilibrium (LD) between the underlying 
causal variants and the directly genotyped SNPs are similar across the populations for the same SNPs to emerge with surrogate 
evidence of disease association. We introduce a novel strategy for assessing regional evidence of phenotypic association that 
explicitly incorporates the extent of LD in the region. This provides a natural framework for combining evidence from multi- 
ethnic studies of both dichotomous and quantitative traits that (i) accommodates different patterns of LD, (ii) integrates 
different genotyping platforms and (iii) allows for the presence of allelic heterogeneity between the populations. Our method 
can also be generalized to perform gene-based or pathway-based analyses. Applying this method on real GWAS data in type 2 
diabetes (T2D) boosted the association evidence in regions well-established for T2D etiology in three diverse South-East Asian 
populations, as well as identified two novel gene regions and a biologically convincing pathway that are subsequently validated 
with data from the Wellcome Trust Case Control Consortium. 
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INTRODUCTION 

Remarkable achievements have been made in large-scale genetic 
studies of common diseases and complex traits. 1 ' 2 The identification 
of variants in the human genome that are convincingly associated 
with different phenotypes has mainly been carried out in individuals 
of European descent, although increasingly studies involving non- 
Caucasian samples from diverse population groups have been pub- 
lished or are currently being conducted. Genome-wide meta-analyses 
(GWMA) involving tens of thousands of samples have extended 
the success in allowing novel variants with smaller effect sizes to be 
discovered. 3-7 Despite these triumphs, these findings really account 
for only a small fraction of the total disease heritability, 8 suggesting 
undiscovered genetic mechanisms may be responsible or alternative 
methods to analyze these data may be necessary to address the missing 
heritability. 

Current implementation of GWMA requires the same SNPs to 
display consistent evidence of phenotypic association across multiple 
populations. This implicitly assumes that across these populations, (i) 
the same causal variant is present; (ii) the linkage disequilibrium (LD) 
pattern between the causal variant and the assayed SNPs is similar and 



(iii) the effect sizes observed at the assayed SNPs are consistent. 9 ' 10 
Random effects methods for combining data across studies do not 
utilize information from neighboring SNPs that may present concur- 
ring evidence of disease association in different studies, and often have 
the tendency to weaken association signals. 9 SNP-based meta-analyses 
also require the same SNPs to be genotyped in all the populations, 
although this requirement can apparently be addressed by imputation 
strategies that effectively standardize the SNP content across different 
studies 11-13 (Figure 1). However, imputation does not always present 
an effective solution, particularly in the absence of appropriate 
population reference panels. 14 ' 15 

Before recent whole- genome sequencing endeavors, SNP discoveries 
were predominantly made in populations of European ancestry. 16 This 
strong ascertainment bias has inadvertently skewed the SNPs surveyed 
in the International Hap-Map Project, 17 which consequently preju- 
diced the SNP content of commercial genotyping platforms to carry 
tagging SNPs that are liable to exhibit higher minor allele frequencies 
in European populations. 18 This means current genotyping arrays may 
be less optimal for non-European populations, resulting in attenuated 
association signals due to lower allele frequencies and weaker LD. 14 ' 15 
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Figure 1 Illustration of the three scenarios in a meta-analysis, where the genotyped SNPs may be in different degree of LD with the unobserved causal 
variant (star): (i) the ideal situation where the same SNPs are genotyped in two studies, and the LD between them and the functional variant is identical in 
both populations (black arrow); (ii) a realistic situation where the same markers are genotyped in two studies, but different LD patterns exist between them 
and the functional variant (green arrow); (iii) realistic situation where different markers are genotyped in two studies, and cannot be meta-analysed without 
resorting to imputation (pink arrow). The LD between the causal variant and each SNP is represented in different color intensity ranging from white (low LD) 
to red (high LD). 



The pursuit of evidence stronger than the genome-wide significance 
is thus more challenging, and larger sample sizes in non-European 
studies and meta-analyses of ethnically mixed populations are 
required to compensate for variations in LD patterns from European 
populations. 14 ' 19 Instead of seeking individual variants that display 
convincing evidence of phenotypic association across multiple popu- 
lations, a more realistic scenario is perhaps to look for genomic 
regions with consistent clustering of SNPs exhibiting moderate signals 
in these populations. 

In this paper, we propose a novel paradigm for interrogating genetic 
data for disease association given either a dichotomous or a quanti- 
tative outcome. Our method works by quantifying the degree of over- 
representation of associated SNPs in a pre-defined genomic region, 
given a specific definition of statistical significance. Through an eigen- 
decomposition of the matrix measuring the LD between every possible 
pair of SNPs in the region, the effective number of independent SNPs 
as well as the number of independent SNPs exhibiting evidence of 
phenotypic association can be evaluated (Supplementary Figure SI). 
The regional evidence of phenotypic association is thus quantified as 
the extent of over- representation of independent associated SNPs 
against the effective total number of independent SNPs in the region. 
This approach can be applied in a genome-wide fashion by consider- 
ing moving windows of a fixed length within a population. In 
addition, this presents a natural framework for integrating the results 
from multiple studies in a region-based genome-wide meta-analysis, 
where we can sum up the number of independent signals and 
independent SNPs in each region across the different studies, and to 
calculate a single regional P- value for this meta-analysis by quantifying 
the joint extent of over-representation. This framework also allows a 
straightforward extension to consider evidence across genes and 
biological pathways. 



METHODS 

Region-based analysis 

Our region-based meta-analysis approach relies on the principle that when L* 
independent hypotheses are tested at a statistical significance threshold of a% 
(P cr it) } on average we expect aL*/100 of these hypotheses to display statistical 
evidence more significant than oc% by chance. In the application within a 
genome -wide association study (GWAS), suppose there are 100 SNPs in a 
particular genomic window of 250 kb and the threshold for defining statistical 
significance has been set at 1%. Under the null hypothesis that none of the 100 
SNPs are associated with the phenotype, we expect one SNP on average to 
exhibit a P- value that is <0.01 if the 100 SNPs are mutually independent. An 
over- representation of independent SNPs with P- value <0.01 in this genomic 
region thus corresponds to evidence that suggests the region is associated with 
the phenotype. However, the presence of LD implies the assumption of 
independence between the SNPs is unlikely to be valid. 

In order to evaluate the effective number of 'independent' SNPs in each 
genomic region, we perform an eigen-decomposition of the LxL symmetric 
correlation matrix M between the L SNPs with entry rriy denoting the LD in 
directional r 2 between the ith and the jth SNP, where the direction is 
determined by the sign of D'. Here we assume the minor allele frequencies 
of all L SNPs are at least 1%. The resulting eigenvectors effectively represent 
mutually independent contributions in explaining the variance in the correla- 
tion matrix, and each eigenvector is given as a linear combination of SNPs that 
are in at least some degree of LD. The SNP loadings of each eigenvector 
measure the extent each SNP contributes to the eigenvector, and the relative 
loadings between the SNPs for each eigenvector provide a surrogate for the 
degree of correlation between the SNPs. The L eigenvectors thus represent 
independent sources of information from all the SNPs in the window, and 
the number of eigenvectors N tota \ that cumulatively accounts for t% of 
the variance can be determined as argmin; ]T i=1 %>TL%, for 1</<L 
and where X{ represents the eigenvalue corresponding to the ith eigenvector 
e ? . Let w denote a vector of length L with the w\ entry corresponding to one if 
the observed P- value for the ith SNP is <P cr i t , and zero otherwise. Suppose 
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e{j denote the jth component in the ith eigenvector, then the corresponding 
component in the ith scaled eigenvector represented by e' 2 - is \eij\/J2j=i L \ e ij\- The 
effective number of independent SNPs that exhibit P- value <P cr i t is thus 
calculated as 

Ntotd 
i=l 

The regional evidence for the extent of over-representation of SNPs with 
P- values <P cr it is calculated as the upper- tailed P- value of the exact Binomial 
test for observing N^t out of N tota \ SNPs when the success probability is given 
as P cr i t . However, as N^ it can be a non- integer, we estimate the P- value associ- 
ated with JVhit by linear interpolating between the P- values obtained for the 
floor and ceiling integer values of N^ ip or equivalently -Pfl 0 or+(-Pceiling — -Pfloor) x 
(Nhit - L-^hitJ)' where Pfi oor and P ce mng denote the P- values associated with 
L^hit J (the floor integer value of N^ it ) and [N^] (the ceiling integer value 
of Nhit), respectively. Our simulation study suggests this procedure is con- 
servative and tends to underestimate the evidence (Supplementary Figure S2), 
thus the true evidence is likely to be more significant than the resultant linear- 
interpolated P- value. 

A region-based meta-analysis across K independent populations can be 
performed by calculating the corresponding and N tot ^ from each 

population k in the same genomic window. The cumulative evidence across the 
K populations will then be quantified by the upper tailed P- value of the exact 
Binomial test for observing J2k Nhit^Wt ofj^j- N ioia \^ SNPs when the success 
probability is given as P cr [ t . 

Type 2 diabetes (T2D) data sets 

We applied our region-based meta-analysis approach to combine the evidence 
from three separate genome-wide surveys of type 2 diabetes in the Chinese, 
South-East Asian Malays and Asian Indians from Singapore. Results from each 
individual survey and the SNP-based meta-analysis have been reported else- 
where. 20 Briefly, the Chinese GWAS examined 2010 cases and 1945 controls 
(post-QC) that were typed on a mixture of Alumina (San Diego, CA, USA) 610 
(1082 cases/1006 controls) and IlluminalM arrays (928 cases/939 controls). 
The corresponding numbers for the Malay and Indian GWAS were 794 cases/ 
1240 controls and 977 cases/1169 controls, and these were all genotyped on the 
Illumina 610 arrays. A genome -wide region-based meta-analysis was first 
performed between the Chinese data that were genotyped on the two arrays 
to yield a single set of findings for the Chinese experiment. The three 
experiments for the different population groups were used as discovery cohorts 
for a region-based meta-analysis with a window size of 250 kb and a sliding gap 
of 50 kb such that two consecutive windows have a 200-kb overlap. We also 
performed a gene -based meta-analysis across 30 037 genes identified from the 
hgl8 version of the TransMap UCSC gene mapping, with each window 
spanning a 100-kb flanking buffer from the start and end coordinates of 
each gene. A pathway analysis was also performed for 212 pathways in the 
KEGG database 21-23 (http://www.genome.jp/kegg/pathway.html). Each gene 
(inclusive of a 25-kb flanking buffer) in a particular pathway was considered 
as a distinct window, except for genes within 50 kb of each other, which we 
merged as one discrete window. The intra-population evidence for a pathway 
was calculated from the summation of the effective number of independent 
significant and total SNPs across the windows. The P- value threshold (P C rit) was 
set at 0.01. We identified any genomic region that exhibited P- value < 0.001 in 
at least two populations from the region-based and gene-based analyses. This is 
an additional criteria to ensure that at least two populations are contributing 
to the observed signals, given the fundamental strategy of our approach is to 
identify genomic regions that are associated with the outcome in multiple 
populations. We excluded any regions that are known to carry copy number 
changes as estimation of LD is likely to be inaccurate in these regions. For the 
pathway analysis, we identified a pathway that exhibited P- value <0.05 in at 
least two populations. To avoid artificial signals of disease association that were 
the results of erroneous genotype calling, genotyping quality was visually 
ascertained in each cohort for every SNP located in the discovered regions 
from the region-based analysis. To validate the findings, similar analyses 
were performed on the type 2 diabetes data from Phase 1 of the Wellcome 



Trust Case Control Consortium (WTCC). 19 Calculation of LD in each of 
the discovery and validation cohorts was performed with 500 control samples 
from the respective study. 

Software implementation 

The method described in this paper is implemented in three separate C++ 
programs: (i) regionalP for performing genome- wide region-based analysis; (ii) 
regionalP-gene for performing gene-based analysis; and (iii) regionalP-pathway 
for performing pathway-based analysis. The programs are available from 
http: // www. statgen. nus . edu. sg/~ software/regionalP.html. 

Descriptions of the set up for the simulations, along with 
additional methods and analyses are available in the Supplementary Material 
online. 

RESULTS 

Power and false-positive rates 

We compared our method for regional analysis against standard SNP- 
based analyses with (i) only the genotyped SNPs or with (ii) the full 
set of SNPs after imputing against reference panels from phase 2 of the 
HapMap (HapMap2). In the meta-analysis combining the results 
from all three populations, the power of the region-based strategy 
was similar to that from a meta-analysis of the imputed SNPs 
(Figure 2). This was significantly higher than the power from the 
meta-analysis of only the genotyped SNPs. The false -positive rates of 
all three meta-analytic approaches were <5%, although the region- 
based approach had a near- zero false-positive rate when we imposed 
an additional restriction requiring at least two populations to exhibit 
P- values of < 0.001 in the same region (Supplementary Table SI 
online). At a genome- wide significance of 10~ 8 , this additional 
restriction resulted in only a marginal decrease in statistical power, 
although this decrease was more substantial at less stringent signifi- 
cance thresholds. Investigating the sensitivity of our method by the 
allelic spectrum of the simulated causal variants in CEU, we observed 
the region-based approach was less powerful in identifying low- 
frequency causal variants (MAF of causal variant <5%) but was 
marginally more powerful for common causal variants (MAF of causal 
variant >5%, see Figure 2). 

We also explored the performance of the three approaches in the 
presence of allelic heterogeneity, defined as having different causal 
variants in the same gene or genomic location across different 
populations. Specifically, we performed another series of simulations 
assuming two different causal variants in CEU and JPT+CHB, 
while allowing YRI to carry either of the two possible causal 
variants. Our simulation explicitly selected causal variants that are 
at least 20 kb away but within 50 kb of each other. The region-based 
approach significantly outperformed both SNP-based approaches 
in the meta-analyses across CEU and JPT+CHB, particularly at 
lower Type I errors and when LD between the two causal variants 
is low (Figure 3). When the LD between the two causal variants is 
high (r 2 >0.8), there is almost no difference in the results of the 
SNP-based meta-analyses of all three populations at higher Type I 
errors as compared with the power observed in our earlier simula- 
tions with only one causal variant. This is reassuring since we 
expect the two causal variants to behave as effectively a single 
variant when the LD is high. However, the low power experienced 
by the SNP-based methods in the presence of two separate causal 
variants reflects the inadequacy of SNP-based approaches for integrat- 
ing data across diverse populations, and the greatest merit of the 
region-based approach is in the presence of allelic heterogeneity 
between populations where the different causal variants are in weak 
and non-existent LD. 
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Figure 2 Power comparisons of the different methods for the meta-analysis across all three Hapmap populations. Simulations were performed with HAPGEN 
(Wellcome Trust Centre for Human Genetics, Oxford, UK) assuming a causal variant that was present in all HapMap phase 2 panels with a multiplicative allelic 
relative risk of 1.5. The case-control genotype data were subsequently thinned to the SNP content of Affymetrix 500K (CEU simulations), lllumina 1M (JPT+CHB 
simulations) and Affymetrix 6.0 (Santa Clara, CA, USA) (YRI simulations). We calculate the power when only the genotyped SNPs were considered (green triangles), 
and when we performed region-based analyses of 100 kb regions in each of the three populations (red circles). Imputation was performed with population-specific 
haplotypes to recover the SNPs removed from the thinning (except for the causal SNP), and a SNP-based analysis was performed on this denser set of imputed and 
genotyped SNPs (blue diamonds). The SNP-based meta-analyses considered either the genotyped SNPs present across all three platforms only (green triangles) or 
across the denser set of imputed and genotyped SNPs common to all three populations (blue diamonds). The region-based meta-analysis was performed without 
restriction (red circles), and with the restriction that at least two populations display region-based P-value <0.001 (red open circles). 
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Figure 3 Power comparisons of the different methods for meta-analysis in the presence of allelic heterogeneity. A different causal variant was selected in CEU 
and JPT+CHB, respectively, while either of the two causal variants was equally likely to be present in the YRI simulations. The two causal variants are located 
at least 20 kb away but are not >50kb apart, and have minor allele frequencies of at least 10% in all three HapMap populations. The case-control genotype 
data simulated from HAPGEN were subsequently thinned to the SNP content of Affymetrix 500K (CEU), lllumina 1M (JPT+CHB) and Affymetrix 6.0 (YRI). 
We calculated the power when only the CEU and JPT+CHB populations were combined (top row), and when all three HapMap panels were combined (bottom 
row), investigating the performance of the meta-analysis across the SNPs on all three arrays (green triangles), and for the region-based meta-analysis considering 
250 kb regions (red circles). Imputation was performed with population-specific haplotypes to recover the SNPs removed from the thinning, and a SNP-based 
meta-analysis was performed on this denser set of imputed and genotyped SNPs common to all three populations (blue diamonds). We binned the 3000 pairs 
of causal variants according to the LD between the two SNPs into four groups: (i) O^/^O.l; (ii) 0.1 </ 2 <0.3; (iii) 0.3 <a 2 <0.8; (iv) 0.8 </ 2 <l. 



Application to T2D data 

We applied our method to perform region-based, gene-based and 
pathway-based meta-analyses in three independent genome- wide 



studies of type 2 diabetes (T2D) involving the Chinese, Malays and 
Asian Indians in Singapore. This was performed across all the 
autosomal chromosomes within each of the three GWAS in a 
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hypothesis- generating fashion, where for the region-based analyses we 
considered sliding windows of 250 kb each with a sliding distance of 
50 kb such that every pair of consecutive windows overlapped by 
200 kb. About half of the Chinese samples were genotyped on the 
Illumina 1M array, while the remaining half of the Chinese, Malay and 
Indian samples were genotyped on the Illumina 610 array. Results of 
the SNP-based meta-analyses using both the genotyped SNPs and the 
imputed SNPs have been reported elsewhere. 20 Briefly, none of the 
SNPs achieved genome- wide significance in the meta-analyses, 
although variants in CDKAL1 and HHEX/IDE/KIF11 displayed mod- 
erate evidence of T2D association in at least two of the three 
populations. In particular, variants in CDKAL1 were found against a 
genomic background exhibiting substantial LD variations between the 
populations. 24 

The genome-wide meta-analysis with our region-based method 
identified five regions exhibiting P< 0.001 in at least two of the 
three populations (Table 1). Other than the region on chromosome 
6 that encompassed CDKAL1, the other four regions did not emerge 
in the SNP-based meta-analyses 20 (Supplementary Table S2 online). In 
the replication experiment with the WTCCC data, two of these five 
regions displayed strong evidence of regional association (P< 10~ 4 ) in 
the case-control T2D GWAS, which included the stretch on chromo- 
some 6 encompassing CDKAL1 and the region on chromosome 3 
between 21.73 and 22.13 Mb that encompassed ZNF659. Suggestive 
corroborative evidences (P<0.05) from WTCCC were also seen in the 
region on chromosome 2 that spanned the STK39 gene and the region 
on chromosome 14 containing the genes GNG2 and NID2. There was 
no evidence of regional association in the WTCCC for the remaining 
region on chromosome 20 spanning STX16 and NPEPL1. 



Remarkably, all five regions have been previously implicated in 
diabetes, obesity or other cardiovascular biomarkers. The convincing 
signal for the region encompassing CDKAL1 is consistent with 
established findings for T2D, 25-30 while ZNF659 has been associated 
with young-onset type 2 diabetes in the American Indians. 31 The 
STK39 gene has been consistently reported to harbor variants impli- 
cated in hypertension and in obesity and diabetes-related rodent 
quantitative trait loci. 32 Previous pathway analysis has identified the 
G-protein GNG2 to be associated with type 1 diabetes, 33 suggesting a 
serotonin modulating mechanism that is similarly relevant in the 
etiology of type 2 diabetes. Variants in STX16 have also been reported 
to significantly slow the reversal of insulin-stimulated glucose trans- 
port, 34 ' 35 a biological mechanism that is highly relevant to T2D. 

DISCUSSION 

The scale of GWMA with diverse European and non-European 
populations is expected to increase markedly given the popularity of 
genome-wide designs in studying the genetic etiology of common 
diseases and complex traits. This, however, increases the challenge of 
accommodating varying patterns of LD that may exist between 
genetically diverse populations, which can compromise the ability to 
reproduce the association signals from surrogate markers that are 
correlated to the unobserved functional polymorphisms. We have 
introduced an alternative strategy for combining the evidence across 
different populations that is robust to dissimilar patterns of LD 
surrounding a bona fide association signal. The approach is applicable 
to both case-control studies or in association studies of quantitative 
traits. Our method has also been shown to perform comparably to 
imputation-based meta-analysis, except it relies on available genotype 



Table 1 Results of the region-based meta-analysis for type 2 diabetes 
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Genomic regions identified by the region-based analysis, with the discovery mechanism based on three genome-wide association studies conducted in Chinese, Malays and Asian Indians in 

Singapore. Validation of the regions that emerged was performed on the type 2 diabetes case-control study from Phase 1 of the Wellcome Trust Case-Control Consortium (WTCCC). 

a The start and end positions of the genomic region containing consecutive windows with P<0.001 in at least two of the populations (in bold). The start and end positions of the top 250 kb window 

are shown in brackets. Subsequent columns show the evidence for the discovery populations in the top window. 

b The three discovery populations abbreviated: C, SP2 Chinese; M, SiMES Malays; I, SINDI Indians. 

Effective number of independent SNPs with P<0.01 after accounting for LD. 

Effective number of independent SNPs across the region after accounting for LD. 

e The start and end positions of the genomic region containing consecutive windows with evidence of validation (defined as P<0.05), with the start and end positions of the top 250 kb being shown 
in brackets. Subsequent columns show the evidence for WTCCC1 in the top window. For regions without any 250 kb windows displaying P<0.05, the best window in that region is shown instead. 
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information from the experiment without requiring additional refer- 
ence data from appropriately matched populations. In the presence of 
allelic heterogeneity, our approach outperforms both SNP-based 
approaches using either genotyped or imputed SNPs. The application 
of the region-based method to three genome-wide surveys in T2D 
resulted in the discovery of novel and established regions that are 
subsequently validated with data from the WTCCC. 

The region-based approach relies on the elegant application of the 
concept of statistical significance in evaluating a genomic region for 
evidence of trait association. For example, under the null hypothesis 
that the region is independent of the phenotype, we expect 5% of the 
SNPs to be statistically significant by chance when adopting a P- value 
threshold of 5%, if indeed all the SNPs in this region are mutually 
independent. If this assumption of mutual independence is true, an 
over- representation of statistically significant SNPs in this region 
constitutes evidence that this region is associated with the phenotype, 
with the extent of over- representation indicating the strength of the 
evidence. This is analogous to the use of 5x 10~ 8 as the definition of 
genome -wide significance for assessing the likely authenticity of single 
markers. LD between the SNPs can confound the measurement of 
over- representation, as this can either inflate the number of significant 
signals, which increases false positives, or produce an inflated estimate 
of the total number of SNPs, which decreases statistical power. The 
eigen-decomposition of the LD matrix allows the effective number of 
independent SNPs to be estimated and consequently, also the effective 
number of independent association signals that are statistically sig- 
nificant. By surveying the same genomic region across different 
independent populations, the same statistical framework can be 
extended to consolidate the evidence from multiple populations, 
simply by summing the effective number of independent SNPs and 
signals across these populations and assessing the evidence for an over- 
representation of significant signals. This provides a simple but, yet, 
effective solution to combining the results from experiments that use 
different genotyping platforms. By searching for the same regions 
rather than the same SNPs to emerge in the different GWAS, inter- 
population variation in LD patterns between the assayed SNPs and the 
causal variant is expected to have lesser impact on the sensitivity of 
our approach. 

One feature of our method is the ability to sharpen the association 
evidence in regions containing multiple weak signals across different 
ethnic groups. These signals may be weaker as a result of SNP 
ascertainment biases in the design of genotyping arrays, resulting in 
weaker LD between the assayed SNPs and the causal variants. The 
current definition of genome- wide significance excludes many poten- 
tial signals to be considered in a bid to protect against the abundance 
of false discoveries that is associated with testing in excess of a million 
hypotheses. This poses a significant challenge to genome-wide studies 
and GWMA in populations with short LD, such as African popula- 
tions, 15 ' 36 as it is less likely for variants to be in sufficient LD to exhibit 
statistical evidence stronger than the stringent threshold. Furthermore, 
the greater genetic diversity that is common of such populations means 
it is not immediately straightforward to compensate for the lower 
LD by increasing the effective sample size through a meta-analysis 
of several populations. Our method thus provides a viable solution 
within a sound statistical framework to exploit and combine the 
evidence from SNPs that are weakly associated with the phenotype. 

The application of analytical methods that investigate regions in the 
genome rather than relying on individual SNPs is not a new concept. 
Neither is implementing a statistical strategy to estimate the effective 
number of independent association tests in the presence of LD. 
Numerous approaches have in fact been introduced to address the 



issue of multiple testing in the presence of correlated SNPs. 37-43 
However, these methods either assign the most significant SNP- 
based evidence as the statistical evidence for the set of loci, 39 or do 
not explicitly incorporate the association evidence in adjusting for the 
effective number of tests. 38 ' 40-43 A recent region-based approach 
adopted a more sophisticated approach that borrows information 
from surrounding SNPs, although it tends to rely on heuristic 
measures such as the proximity to specific genomic features (eg, 
known genes, evolutionarily conserved regions and haplotype blocks) 
for defining SNP clusters. 44 In our opinion, the imputation frame- 
works that MACH 12 and IMPUTE 13 are built on provide a more 
natural way to incorporate information from surrounding SNPs 
without relying on pre-defined features that may not adequately 
account for the correlation between SNPs. We thus benchmarked 
our method against the performance of the imputation-based 
approach, which has become the strategy of choice in recent gen- 
ome-wide studies. More importantly, neither of the previous region- 
based approaches provide a natural solution to integrate the evidence 
across multiple genome-wide studies in a meta-analysis, nor ade- 
quately manage the complexity due to allelic heterogeneity. 

We have proposed a novel and powerful strategy for querying the 
genome for genotype-phenotype associations that realistically man- 
ages the challenges imposed by the fundamental design of genome- 
wide studies and in combining several such studies from diverse 
populations. We envisage this approach has the potential to be further 
developed for burden-related tests of rare or low-frequency variants 
across multiple heterogeneous populations, which is an emerging issue 
given the increasing popularity of exome-sequencing experiments 
across numerous traits. 
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